Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Key information extraction algorithm of news Web pages

XIANG Jingjing, GENG Guanggang, LI Xiaodong

Journal of Computer Applications 2016, 36 (8): 2082-2086. DOI: 10.11772/j.issn.1001-9081.2016.08.2082

Abstract （633）

PDF （888KB）（597）

Save

Since information extraction algorithm for Web pages lacks generality and information of title, release-time and source in news Web page, a new information extraction algorithm was proposed to resolve those problems. Firstly, HTML code of Web page was parsed to text sets combined with line number and text; then, extractor began to search boundary of news content from line which the longest sentence belonged to due to the characteristic that the longest sentence belongs to the content of news with an extremely high probability. Meanwhile, the longest common string algorithm was used to extract title, the regular expression and line number were used to extract release-time, and the presentation characteristics of source and line number were used to extract source. Finally, a data set was built to conduct a comparison experiment with an open-source software named newsPaper in accuracy of extraction. Experimental results show that newsExtractor outperforms newsPaper in average accuracy of content, title, release-time and source, it has strong generality and robustness.

Reference | Related Articles | Metrics